在這篇教學文章中,介紹了如何使用 FinRL 庫來訓練深度強化學習 (Deep Reinforcement Learning, DRL) 的代理人 (Agent) 進行股票交易。
在強化學習中,主要有兩個核心組件:代理人 (Agent) 和 環境 (Environment)。簡單來說,代理人在一個環境中行動,並根據環境的反饋來學習更好的行為策略。具體來說:
目標是通過不斷地互動,讓代理人學會在不同的市場狀況下,做出能最大化長期收益的動作。
StockTradingEnv
(Market Environment)這個步驟使用 OpenAI Gym 的風格來構建股票交易的環境,我們將歷史股票數據轉換成強化學習的環境。在這裡,代理人將會與這個環境進行互動。
stock_dimension = len(train.tic.unique())
state_space = 1 + 2 * stock_dimension + len(INDICATORS) * stock_dimension
print(f"Stock Dimension: {stock_dimension}, State Space: {state_space}")
buy_cost_list = sell_cost_list = [0.001] * stock_dimension
num_stock_shares = [0] * stock_dimension
env_kwargs = {
"hmax": 100,
"initial_amount": INIT_AMOUNT,
"num_stock_shares": num_stock_shares,
"buy_cost_pct": buy_cost_list,
"sell_cost_pct": sell_cost_list,
"state_space": state_space,
"stock_dim": stock_dimension,
"tech_indicator_list": INDICATORS,
"action_space": stock_dimension,
"reward_scaling": 1e-4
}
e_train_gym = StockTradingEnv(df=train, **env_kwargs)
這邊針對初始化 StockTradingEnv
的一些設定參數做一些解析:
hmax
含義:
hmax
表示每個時間步中,每支股票最大可交易的股票數量。它限制了在單個時間步中可買入或賣出的股票數量上限,防止一次性交易過多的股票。在環境中的作用:
hmax
,您可以控制在每個時間步中最大能交易的股票數量,模擬現實中交易量的限制或風險管理策略。代碼中的使用:
在 step
函數中,動作(actions)被處理如下:
hmax
,將動作映射到 [-hmax
, hmax
],表示每支股票可買入或賣出的最大股數。
hmax = 100
,那麼智能體在單個時間步中,最多可買入或賣出 100 股每支股票。def step(self, actions):
# 將動作從 [-1, 1] 的範圍映射到 [-hmax, hmax]
actions = actions * self.hmax # actions initially is scaled between -1 to 1
actions = actions.astype(int) # 將動作轉換為整數,因為股票數量必須是整數
# ...
範例:
[0.5, -0.3, 0.1]
,表示對三支股票的操作。hmax = 100
後,得到 [50, -30, 10]
,表示:
initial_amount
初始時的現金總額。
num_stock_shares
num_stock_shares
是一個列表,表示在環境初始化時,智能體持有的每支股票的初始股數。
它定義了智能體在交易開始前,對每支股票的初始持倉。
範例:
num_stock_shares = [10, 20]
,股票價格分別為 50 和 100。state_space
狀態空間(state_space)指的是智能體在每個時間點下能夠觀察到的狀態數值有多少維度。
state_space = 1 + 2 * stock_dimension + len(INDICATORS) * stock_dimension
現金餘額(1維):
self.state[0]
。股票價格(stock_dim
維):
stock_dim
支股票,因此這部分狀態佔據 stock_dim
個維度,存儲在 self.state[1 : stock_dim + 1]
。持有的股票數量(stock_dim
維):
stock_dim
個維度,存儲在 self.state[stock_dim + 1 : 2 * stock_dim + 1]
。技術指標(len(INDICATORS) * stock_dim
維):
len(INDICATORS)
個技術指標,那麼對於每支股票,這部分狀態佔據 len(INDICATORS)
個維度。len(INDICATORS) * stock_dim
個維度,存儲在 self.state[2 * stock_dim + 1 :]
。action_space
action_space
定義了智能體在每個時間步可以執行的動作空間的維度。
在環境中,action_space
通常設定為 stock_dim
,即股票的數量。
這意味著智能體的動作是一個向量,每個元素對應一支股票的買賣操作。
在環境中的作用:
action_space
確定了智能體輸出的動作向量的長度,反映了智能體可以對多少支股票進行操作。範例:
stock_dim = 3
,那麼 action_space = 3
,智能體的動作向量長度為 3。[0.7, -0.5, 0]
表示:
tech_indicator_list
:
buy_cost_pct
和 sell_cost_pct
:
使用 Stable Baselines 3 庫中的 DRL 演算法來訓練代理人,這裡提供了多種演算法選擇,比如 A2C、PPO、SAC 等。
每個演算法的具體步驟包括:
下面是使用 A2C 演算法進行訓練的範例:
from finrl.agents.stablebaselines3.models import DRLAgent
# 設定 A2C 模型
agent = DRLAgent(env=env_train)
model_a2c = agent.get_model("a2c")
# 訓練模型
trained_a2c = agent.train_model(model=model_a2c, tb_log_name='a2c', total_timesteps=50000)
# 保存訓練好的模型
trained_a2c.save('trained_models/agent_a2c')
import pandas as pd
from stable_baselines3.common.logger import configure
from finrl.agents.stablebaselines3.models import DRLAgent
from finrl.config import INDICATORS, TRAINED_MODEL_DIR, RESULTS_DIR
from finrl.main import check_and_make_directories
from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv
check_and_make_directories([TRAINED_MODEL_DIR])
train = pd.read_csv('train_data.csv')
# If you are not using the data generated from part 1 of this tutorial, make sure
# it has the columns and index in the form that could be make into the environment.
# Then you can comment and skip the following two lines.
train = train.set_index(train.columns[0])
train.index.names = ['']
stock_dimension = len(train.tic.unique())
state_space = 1 + 2*stock_dimension + len(INDICATORS)*stock_dimension
print(f"Stock Dimension: {stock_dimension}, State Space: {state_space}")
buy_cost_list = sell_cost_list = [0.001] * stock_dimension
num_stock_shares = [0] * stock_dimension
env_kwargs = {
"hmax": 100,
"initial_amount": 1000000,
"num_stock_shares": num_stock_shares,
"buy_cost_pct": buy_cost_list,
"sell_cost_pct": sell_cost_list,
"state_space": state_space,
"stock_dim": stock_dimension,
"tech_indicator_list": INDICATORS,
"action_space": stock_dimension,
"reward_scaling": 1e-4
}
e_train_gym = StockTradingEnv(df = train, **env_kwargs)
env_train, _ = e_train_gym.get_sb_env()
print(type(env_train))
agent = DRLAgent(env = env_train)
# Set the corresponding values to 'True' for the algorithms that you want to use
if_using_a2c = True
if_using_ddpg = True
if_using_ppo = True
if_using_td3 = True
if_using_sac = True
##### Agent 1: A2C
agent = DRLAgent(env = env_train)
model_a2c = agent.get_model("a2c")
if if_using_a2c:
# set up logger
tmp_path = RESULTS_DIR + '/a2c'
new_logger_a2c = configure(tmp_path, ["stdout", "csv", "tensorboard"])
# Set new logger
model_a2c.set_logger(new_logger_a2c)
trained_a2c = agent.train_model(model=model_a2c,
tb_log_name='a2c',
total_timesteps=50000) if if_using_a2c else None
trained_a2c.save(TRAINED_MODEL_DIR + "/agent_a2c") if if_using_a2c else None
##### Agent 2: DDPG
agent = DRLAgent(env = env_train)
model_ddpg = agent.get_model("ddpg")
if if_using_ddpg:
# set up logger
tmp_path = RESULTS_DIR + '/ddpg'
new_logger_ddpg = configure(tmp_path, ["stdout", "csv", "tensorboard"])
# Set new logger
model_ddpg.set_logger(new_logger_ddpg)
trained_ddpg = agent.train_model(model=model_ddpg,
tb_log_name='ddpg',
total_timesteps=50000) if if_using_ddpg else None
trained_ddpg.save(TRAINED_MODEL_DIR + "/agent_ddpg") if if_using_ddpg else None
##### Agent 3: PPO
agent = DRLAgent(env = env_train)
PPO_PARAMS = {
"n_steps": 2048,
"ent_coef": 0.01,
"learning_rate": 0.00025,
"batch_size": 128,
}
model_ppo = agent.get_model("ppo",model_kwargs = PPO_PARAMS)
if if_using_ppo:
# set up logger
tmp_path = RESULTS_DIR + '/ppo'
new_logger_ppo = configure(tmp_path, ["stdout", "csv", "tensorboard"])
# Set new logger
model_ppo.set_logger(new_logger_ppo)
trained_ppo = agent.train_model(model=model_ppo,
tb_log_name='ppo',
total_timesteps=200000) if if_using_ppo else None
trained_ppo.save(TRAINED_MODEL_DIR + "/agent_ppo") if if_using_ppo else None
##### Agent 4: TD3
agent = DRLAgent(env = env_train)
TD3_PARAMS = {"batch_size": 100,
"buffer_size": 1000000,
"learning_rate": 0.001}
model_td3 = agent.get_model("td3",model_kwargs = TD3_PARAMS)
if if_using_td3:
# set up logger
tmp_path = RESULTS_DIR + '/td3'
new_logger_td3 = configure(tmp_path, ["stdout", "csv", "tensorboard"])
# Set new logger
model_td3.set_logger(new_logger_td3)
trained_td3 = agent.train_model(model=model_td3,
tb_log_name='td3',
total_timesteps=50000) if if_using_td3 else None
trained_td3.save(TRAINED_MODEL_DIR + "/agent_td3") if if_using_td3 else None
##### Agent 5: SAC
agent = DRLAgent(env = env_train)
SAC_PARAMS = {
"batch_size": 128,
"buffer_size": 100000,
"learning_rate": 0.0001,
"learning_starts": 100,
"ent_coef": "auto_0.1",
}
model_sac = agent.get_model("sac",model_kwargs = SAC_PARAMS)
if if_using_sac:
# set up logger
tmp_path = RESULTS_DIR + '/sac'
new_logger_sac = configure(tmp_path, ["stdout", "csv", "tensorboard"])
# Set new logger
model_sac.set_logger(new_logger_sac)
trained_sac = agent.train_model(model=model_sac,
tb_log_name='sac',
total_timesteps=70000) if if_using_sac else None
trained_sac.save(TRAINED_MODEL_DIR + "/agent_sac") if if_using_sac else None